67 research outputs found
E2E Refined Dataset
Although the well-known MR-to-text E2E dataset has been used by many
researchers, its MR-text pairs include many deletion/insertion/substitution
errors. Since such errors affect the quality of MR-to-text systems, they must
be fixed as much as possible. Therefore, we developed a refined dataset and
some python programs that convert the original E2E dataset into a refined
dataset.Comment: 4 page
Towards Machine Speech-to-speech Translation
There has been a good deal of research on machine speech-to-speech translation (S2ST) in Japan, and this article presents these and our own recent research on automatic simultaneous speech translation. The S2ST system is basically composed of three modules: large vocabulary continuous automatic speech recognition (ASR), machine text-to-text translation (MT) and text-to-speech synthesis (TTS). All these modules need to be multilingual in nature and thus require multilingual speech and corpora for training models. S2ST performance is drastically improved by deep learning and large training corpora, but many issues still still remain such as simultaneity, paralinguistics, context and situation dependency, intention and cultural dependency. This article presents current on-going research and discusses issues with a view to next-generation speech-to-speech translation.En Japón se han llevado a cabo muchas actividades de investigación acerca de la traducción automática del habla. Este artículo pretende ofrecer una visión general de dichas actividades y presentar las que se han realizado más recientemente. El sistema S2ST está formado básicamente por tres módulos: el reconocimiento automático del habla continua y de amplios vocabularios (Automatic Speech Recognition, ASR), la traducción automática de textos (Machine translation, MT) y la conversión de texto a voz (Text-to-Speech Synthesis, TTS). Todos los módulos deben ser plurilingües, por lo cual se requieren discursos y corpus multilingües para los modelos de formación. El rendimiento del sistema S2ST mejora considerablemente por medio de un aprendizaje profundo y grandes corpus formativos. Sin embargo, todavía hace falta tratar diversos aspectos, com la simultaneidad, la paralingüística, la dependencia del contexto y de la situación, la intención y la dependencia cultural. Por todo ello, repasaremos las actividades de investigación actuales y discutiremos varias cuestiones relacionadas con la traducción automática del habla de última generación.Al Japó s'han dut a terme moltes activitats de recerca sobre la traducció automàtica de la parla. Aquest article n'ofereix una visió general i presenta les activitats que s'han efectuat més recentment. El sistema S2ST es compon bàsicament de tres mòduls: el reconeixement automàtic de la parla contínua i de vocabularis extensos (Automatic Speech Recognition, ASR), la traducció automàtica de textos (Machine translation, MT) i la conversió de text a veu (Text-to-Speech Synthesis, TTS). Tots els mòduls han de ser plurilingües, per la qual cosa es requereixen discursos i corpus multilingües per als models de formació. El rendiment del sistema S2ST millora considerablement per mitjà d'un aprenentatge profund i de grans corpus formatius. Tanmateix, encara cal tractar diversos aspectes, com la simultaneïtat, la paralingüística, la dependència del context i de la situació, la intenció i la dependència cultural. Així, farem un repàs a les activitats de recerca actuals i discutirem diverses qüestions relacionades amb la traducció automàtica de la parla d'última generació
An Empirical Study of Mini-Batch Creation Strategies for Neural Machine Translation
Training of neural machine translation (NMT) models usually uses mini-batches
for efficiency purposes. During the mini-batched training process, it is
necessary to pad shorter sentences in a mini-batch to be equal in length to the
longest sentence therein for efficient computation. Previous work has noted
that sorting the corpus based on the sentence length before making mini-batches
reduces the amount of padding and increases the processing speed. However,
despite the fact that mini-batch creation is an essential step in NMT training,
widely used NMT toolkits implement disparate strategies for doing so, which
have not been empirically validated or compared. This work investigates
mini-batch creation strategies with experiments over two different datasets.
Our results suggest that the choice of a mini-batch creation strategy has a
large effect on NMT training and some length-based sorting strategies do not
always work well compared with simple shuffling.Comment: 8 pages, accepted to the First Workshop on Neural Machine Translatio
Beyond Vectors: Subspace Representations for Set Operations of Embeddings
In natural language processing (NLP), the role of embeddings in representing
linguistic semantics is crucial. Despite the prevalence of vector
representations in embedding sets, they exhibit limitations in expressiveness
and lack comprehensive set operations. To address this, we attempt to formulate
and apply sets and their operations within pre-trained embedding spaces.
Inspired by quantum logic, we propose to go beyond the conventional vector set
representation with our novel subspace-based approach. This methodology
constructs subspaces using pre-trained embedding sets, effectively preserving
semantic nuances previously overlooked, and consequently consistently improving
performance in downstream tasks
- …